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Abstract Belief compression improves the tractability of large-scale partially 
observable Markov decision processes (POMDPs) by finding projections from 
high-dimensional belief space onto low-dimensional approximations, where solv¬ 
ing to obtain action selection policies requires fewer computations. This paper 
develops a unified theoretical framework to analyse three existing linear belief 
compression approaches, including value-directed compression and two non¬ 
negative matrix factorisation (NMF) based algorithms. The results indicate 
that all the three known belief compression methods have their own crit¬ 
ical deficiencies. Therefore, projective NMF belief compression is proposed 
(P-NMF), aiming to overcome the drawbacks of the existing techniques. The 
performance of the proposed algorithm is examined on four POMDP problems 
of reasonably large scale, in comparison with existing techniques. Additionally, 
the competitiveness of belief compression is compared empirically to a state- 
of-the-art heuristic search-based POMDP solver and their relative merits in 
solving large-scale POMDPs are investigated. 

Keywords Belief Compression • POMDP • Nonnegative Matrix Factorisation 


1 Introduction 

Making decisions in dynamic environments is one of the core problems of 
artificial intelligence. In many cases, it requires not only evaluation of the 
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reward or cost of the intermediate action, but also consideration of the long 
term effect of a sequence of choices made in the future. If the true states 
of the system can be perfectly identihed, Markov decision processes (MDPs) 
can be used to handle such planning problems efficiently. However, in real- 
world applications, the true system states are not always fully observable. 
Uncertainties may come from different sources, e.g. noisy sensors in a robotic 
system, or speech recognition errors in a spoken dialogue system, etc., which 
motivates the utilisation of probabilistic techniques to track the system states. 

The partially observable Markov decision process (POMDP) has been proven 
to be a powerful tool for modelling sequential decision making problems under 
uncertainty. It generalises the standard MDP to the case where an agent can¬ 
not directly observe the underlying states but has to maintain a probability 
distribution (called a belief) over all possible states based on noisy obser¬ 
vations. The optimal policy of a POMDP then specifies an action for each 
possible belief to maximise expected discounted future reward. 

However, the exact solution for the policy optimisation problem of a POMDP 


is computationally intractable (Cassandra 1998). Polynomial-time approxi¬ 


mation algorithms can be achieved using point-based value iteration (PBVI) 
techniques (Pineau et al 2003), which successively estimate the value function 
by updating the value and its gradient only at the points of a witness point 
set. In this case, the dimensionality of the belief space dominates the efhciency 


of the algorithms (see (Pineau et al 2003 Smith and Simmons, 2005 Spaan 


and Vlassis 20051). 


Factored POMDPs explored in various previous studies (see below) pro¬ 
vide a general direction for improving the tractability of large-scale problems 
via dimension reduction. The essential idea behind the factorisation is to de¬ 
compose the original instantiations of the state, action and observation vari¬ 
ables in a POMDP into their respective smaller sets of factor variables. Then 
conditional independence or context-specific independence among those factor 
variables can be exploited to achieve a more compact representation, using cor¬ 


responding techniques such as dynamic Bayesian networks (Hoey et al 2010 


and Poole 1996 


and Feng, 2000 


Thomson and Young 2010 Williams et al 2005), decision trees (Boutilier 


Boutilier et al 2000) or algebraic decision diagrams (Hansen 


Shani et al 2008). Unfortunately, such factored representa¬ 


tions do not necessarily result in efhcient policy implementations. Although 
for particular types of POMDP problems we will also be able to express the 
transition, observation and reward functions in a compact form with respect to 
their respective factor variables and optimise the policies in lower-dimensional 
spaces (Ong et al 2010 Poupart ,2005 Sim et al 2008), this method does not 
generalise to all domains by default. 

Belief compression provides an alternative solution to reduce the computa¬ 
tion cost for POMDP policy optimisation by projecting the high-dimensional 
belief space into a low-dimensional one, using an automatically obtained pro¬ 
jection basis. Main contributions in this area include exponential family prin¬ 
cipal component analysis (EPCA)-based compression (?), value-directed com¬ 
pression (VDC) (Poupart1|2005[ Poupart and Boutilier 2002) and nonnegative 
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matrix factorisation (NMF)-based compression (Li et al, 2007 Theocharous 


and Mahadevan, 2010|. The EPCA approach achieves a non-linear compres¬ 


sion by exploiting the sparsity of the belief space based on sampled beliefs, 
whilst VDC produces a linear projection in a value-directed manner such that 
a belief and its compression will obtain an (approximately) identical value. 
Due to the nature of linear projection, VDC has the advantage that the piece- 
wise linear and convex (PWLC) property of the value function remains after 
the compression, hence PBVI algorithms can be directly applied to solve the 
compressed POMDPs. Benefiting from the insights behind both EPCA and 
VDC, the NMF-based algorithms seek a projection basis (also based on sam¬ 
pled beliefs) that yields a low-rank approximation of the belief space, and use 
it to construct a linear compression to preserve convenience for PBVI. 

In this paper, after reviewing some background knowledge of POMDPs 
(§2), we develop a unified theoretical framework to analyse linear belief com¬ 
pression algorithms in general (§3). To the best of our knowledge, such an anal¬ 
ysis has not been reported before. After this, the results are employed to exam¬ 
ine three existing linear POMDP compression algorithms separately, including 
VDC (Poupart 20051, orthogonal NMF (0-NMF) compression (Li et al, 20071, 


and locality preserving NMF (LP-NMF) compression (Theocharous and Ma¬ 


hadevan 20101 (which are the only three linear belief compression methods 


that we are aware of). Our findings show that all the three existing models 
have their own critical deficiencies. For VDC (§4), not only can the compressed 
value function violate the contractive property of a valid Bellman recursion 
(Bellman 19571, and therefore can diverge to infinity in the worst case (even 
when the compression error is extremely small), but also the lack of nonnega¬ 
tivity constraints on its compression basis can confuse the pruning procedure 
in PBVI and drive the algorithm to an ill converging point. On the other 
hand, both the 0-NMF and LP-NMF approaches share the common draw¬ 
back that compression error does not directly relate to value loss (§5), which 
results in good compressions not necessarily leading to promising policies. 
Therefore, a novel projective NMF belief compression algorithm (P-NMF) is 
proposed (§6), aiming to revise the deficiencies of the existing techniques. Ex¬ 
perimental results on four POMDP problems of reasonably large scale show 
that the proposed model outperforms the existing techniques (§7.1). In ad¬ 
dition, we also investigate the practical effectiveness of belief compression in 
solving large-scale POMDPs, in comparison with a state-of-the-art (uncom¬ 
pressed) POMDP solver called SARSOP (Kurniawati et al, 2008| ) (§7.2), before 
we conclude (§8). 


2 POMDP Basics 

A POMDP is a tuple {S,A,Z,T,n,R,ri), where the components are defined 
as follows. S, A and Z are the sets of states, actions and observations respec¬ 
tively. The transition function T(s'|s,a) defines the conditional probability of 
transiting from state s € 5 to state s' G S after taking action a G A. The 
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observation function f2{z\s,a) gives the probability of the occurrence of ob¬ 
servation z € Z in state s after taking action a. R{s, a) is the reward function 
specifying the immediate reward of a state-action pair. Whilst, 0 < 77 < 1 is a 
discount factor. In this paper, we will focus on POMDPs with discrete state, 
action, and observation spaces. 

A standard POMDP operates as follows. At each time step, the system is 
in an unobservable state s, for which only an observation z can be received. 
A distribution over all possible states is therefore maintained, called a belief, 
denoted by 6 , where the probability of the system being in state s is b(s). Based 
on the current belief, the system selects an action a, receives a reward i?(s, a) 
and transits to a new (unobservable) state s' where it receives an observation 
z'. Then the belief is updated to b' based on z' and a as follows: 

b'{s') = Pr(s'|z', a, b) = ^^ Q{z'\s', a) ^ r(s'|a, s)b{s) (1) 

where Pr(z'|a, b) = ^2(z'\s', a) J2s T{s'\a, s)b{s) is a normalisation factor. 


2.1 Policy and Value Function 


A policy TT is defined as a mapping that maps each belief b to an action 
a = 7 r( 5 ). The value function of a given policy tt and given starting point &o is 
the expected sum of discounted rewards, calculated as: 


V^bo)=E 


n 


_t^0 


( 2 ) 


where n is the planning horizon (possibly inifinite) and r^(^i,_^-^(bt) is the immedi¬ 
ate reward obtained at time t using policy tt. The objective of POMDP-based 
planning is to determine an optimal policy tt* = arg max^r P^( 6 ) that max¬ 
imises the value function. The value function corresponding to tt* is usually 
denoted by V*. 


2.2 Value Iteration 


Commonly used policy optimisation algorithms include value iteration, pol¬ 
icy iteration and linear programming. In this paper, we will focus on value 
iteration related techniques. Exact value iteration recursively computes the 
optimal value function V* as the sequence of value functions V„ starting from 
an initial Vq- 


Vn+i{b) = max 
aGA 


R{b, a) +r] ^ Pr( 6 '| 6 , a, z)Pr(z|a, b)Vn{b') 

z^Z 


(3) 


where Pr( 6 '| 6 , a, z') is an indicator of b updating to b' on action a and ob¬ 
servation zb If we represent R in matrix form, where R{b,a) = b^R.^a, and 
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define the mapping T“’^ such that = T{sj\a, Si)n{z\a, Sj), Eq. ([^ can be 


re-written as: 


Vn+i{b) := max 
aGA 




zGZ 


Such a recursion is usually expressed in functional form: 


14 


n+1 


= 


(4) 


(5) 


where H is the Bellman backup operator (Bellman 1957). Eq. ([^ is also called 
a Bellman equation or a Bellman recursion. 

The above recursion converges to E*, and can be represented as a piece- 
wise linear convex (PWLC) function: 


Vib) = max 6 a 
aer 


( 6 ) 


where E is a set of vectors called a-vectors, with each a-vector associated with 
an action, such that the action corresponding to the a-vector maximising the 
value function at the current belief b is the one executed by the underlying 
policy TT. 

Value iteration can then be implemented as a dynamic programming pro¬ 
cedure to iteratively construct the a-vectors. Furthermore, the exact Bellman 
backup operator H can be approximated with tractable computations by con¬ 
sidering only a finite set of sampled belief points instead of the entire reachable 
belief space. This is known as point-based value iteration (PBVI). Detailed in¬ 
troductions to PBVI algorithms are omitted in this paper, but some commonly 


Spaan and Vlassis 2005). 


used techniques can be found in (Pineau et al, 2003 Smith and Simmons 2005 


3 Linear Belief Compression 

We start the discussion from an ideal case where we assume that a lossless 
compression is achievable. Linear belief compression can be summarised as 
finding a linear function F £ such that: 

R = FR and 'iaeA,zeZ (7) 

where n = |5| is the dimension of the state space in the original POMDP, 
/c <C n is the compressed state space size, and R and are the compressed 
reward and transition matrices, and can be computed as: 

R = F'fR and ya G A, z G Z (8) 

where F^ G is some certain form of ‘inverse’ of F. Here we temporarily 

keep this notation general, and leave its specifications for different models 
explained in later sections. Let b be the compressed belief, and: 

V = b^F 


(9) 
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For a given policy tt, the value function defined for the compressed problem 
can then be written as: 

v-(h) = + n Y. ( 10 ) 


The underlying theory behind lossless linear belief compression was initially 
proposed by Poupart and Boutilier (20021. We quote their theorem and proof 
here for convenience of further discussion. 


Theorem 1 ( [Poupart and Boutilier ) 

Let B denote the set of all reachable beliefs for a POMDP. Let F, R and 
T“’^ satisfy Eq. then = V'^{b), Vtt, & G B. 

Proof Base case: let lo^(6) = b^R.^T^ib) and Vq (b) = b^R, then 

Vo^b) = 


Induction: let (b) = Vff (6) with n stages-to-go, then 


Z 

= b^R-Mb) : v:ib') = Kd') 

Z 

= b^FR..^(^iy'j -j- ^ ^ Vff {b^ : substituting Eq. 

z 

= = ^rr+l(^) = substituting Eq. ([gjl □ 

z 

Theorem 1 shows that if the conditions in Eq. Q hold, all policies have iden¬ 
tical values with respect to the compressed and uncompressed POMDPs, i.e. 
the compression is lossless. 


3.1 A Complementary Theory of Lossless Belief Compression 


Recall the a-vector representation of the value function in Eq. (|^. For a given 
policy TT, we can express V'^{b) = b^ = b^a'^, where we use to denote 
the a- vector specified by tt in computing the value of the current b. A similar 
representation is applicable to V'^ as well. (For instance V^(b) = b^= 
iF.) Then by substituting Eq. ^ and Eq. (1^ into Eq. (101, the value 


function of the compressed POMDP can be explicitly expressed in the following 
form: 


V^(b) = = b^FV^ 

= b^ FR^^f^^^ + rjYb^ 


( 11 ) 




(12) 
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Let = FV^ and A = FF\ and substitute it into Eq. (Ill and (12). We 
obtain: 

V^{b) = b^V^ = (13) 


Similar to the definition of the original value function V, the recursive function 
V can also be written in functional form as: 


E„+i=iLE„ (14) 

It means that the linear belief compression defined by Eq. § and § will 
actually result in a modified value function with the original Bellman backup 
operator FI replaced by an approximated backup operator H. After this, we 
can obtain the following theorem, which is more relaxed than Theorem 1. 

Theorem 2 Let A be a low-rank square matrix that can be factored into the 
product of two rectangular matrices (of rank k) as A = FF\ and R and T“’^ 
be defined as in Eq. Q). If there exists such an A that satisfies either (i) 
R = AR and T“’^ = AT°-'^, ya G A, z G Z, or (ii) b^ = b^ A, V6 S B, then 
V^ib) = V^(b), ynAGB. 


Proof By defining Eq( 6) = 6^Aii._,r(b) and VQ{b) = b^R.^Tr{b)^ and substi¬ 
tuting either condition (i) or condition (ii), we can obtain E5(6) = 

It is straightforward to induce that if Vf^{b) = Vf({b) with n stages-to-go, 
then E)(+i(6) = EJVi (&), by substituting either condition (i) or condition (ii) 
into Eq. (13|. Finally, substituting the definition of into Eq. (11) gives 


3.2 Convergence of Lossy Belief Compression 


The above discussions are essentially based on the ideal assumption that a 
lossless compression exists, which is usually not the case in practical POMDP 
problems. However, for many problems lossy belief compression can still be 
employed to reduce the computational complexity of policy training (and exe¬ 
cution). Lossy belief compression is designed to seek a projection matrix F by 
minimising some loss criteria, but as a consequence errors will exist between 
Vf({b) and Vf({h). Moreover, such errors may propagate during value itera¬ 
tion, and result in a significant loss in the quality of obtained policy. The error 
propagation problem has been studied in depth for MDPs under reinforcement 


learning scenarios in previous literature (Antos et al 2008 Farahmand et al 


2010 Munos 2007). However, their results do not directly transfer to POMDP 


problems due to more complex backup procedures. Hence, in this paper we 


only study a basic problem: sufficient conditions of the value function Eq. (10) 


under lossy compression being a valid Bellman equation that converges mono- 
tonically to a fixed point. Such convergence implies a bounded loss between 
the original and the compressed value functions. 
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Since V'^{b) and V'^{b) return the same value if b is the compression of b, 
it is much easier to investigate the latter, which only involves adding an extra 
linear operator A to the original value function. 

Lemma 1 Let b — and b = A.e. normalised b). For V and V 

defined in Eq. and (I 4 ) respectively, HV(b) = \\b^A\\iHV(b). 

Proof 


H V{b) = ma_x [^b^AR.^a + A ^ aj 

= ||&^A||i max + pb^ ^ T°'’^ 


= \\b^A\\iHVib) 


□ 


After this, Lemma 1 can be used to prove the following lemma. 


Lemma 2 //77||A||oo < 1, defined in Eq. (13) and (I 4 ) is contractive, i.e. 
for two given value functions Ui and U 2 and the recursion H it holds that 


with 0 < /3 < 1 and 


\HU^ - HU2\\o.<mi-U2\ 
11 00 the supreme norm. 


Proof 

II HUi - FC/2II00 = II HUiib) - i?C/ 2 ( 6 )||oo 

= \\b^AUHU,{b) - HU2Cb)\\oo 

< ||A||oo||-ffC/l-LfC/2||oo 

< v\\A\U\Ui - U2\\^ 

where we utilise the matrix norm property ||&A||i < ||6||i|| A||oo, and the facts 
that ||6||i = 1 and for a standard Bellman recursion, \\HUi — HU 2 \\oo < 
^||C/l-t/2||oo. □ 


The contraction property ensures that the vector space defined by the com¬ 
pressed value function is complete. Therefore, the space of such value functions 
together with the supreme norm form a Banach space, and the Banach fixed- 
point theorem ensures that a single fixed point exists, to which the value 


recursion always converges (Puterman 2005). 


Lemma 3 V defined in Eq. (13) and (I 4 ) is isotonic, i.e. for two given value 


functions Ui and U 2 and the recursion H it holds that 


Ui<U2^ HUi < HU 2 
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Proof Let HUi{b) = H‘^^Ui{b) and HU 2 {b) = H°‘^U 2 {b), with oi and 02 
denoting the actions maximising HUi and HU 2 at point b, respectively. Using 
this definition, we have H°‘^U 2 {b) < H°‘^U 2 (b). 

Let b°’’^ = and and be the a-vectors maximising 

the value function Ui and U 2 at 6“’^ respectively. The following holds 

Ui < U 2 ^ < U2(5“i>"), yb,z 

Z 

< +77^6T^2.ai.z-b,ai,z^ 

z 

=> F“ic/i(6) < H°-W2{b) < i?“"C/2(6), yb 
^ HUi{b) < HU2{b), yb 
^ HUi < HU 2 

□ 

The isotonic property of the value function guarantees that value iteration 
converges monotonically. 

Nevertheless, the above lemmas only consider exact value backup, which 
is intractable in practice. If approximate backup are taken into account, prun¬ 
ing is an inevitable procedure to ensure an efficient size of the a-vector set 

). Moreover, working in the compressed belief space, 
the backup operation is based on V instead of V. Since an essential step in 
pruning is to check the domination of an a-vector by others, the following 
lemma gives a further constraint for a compressed POMDP to be efficiently 
solvable. 

Lemma 4 Let b be a compressed belief with respect to a compression function 
F, as defined in Eq. 0 For two arbitrary a-vectors ai and 0.2 (corresponding 
to the compressed value function), di > d 2 => 6^di > b^a 2 , yb, holds for an 
arbitrary valid POMDP, iff F >0. 

The proof is straightforward, hence is omitted here. 


(Zhang and Zhang, 2005 


4 Value-Directed Compression 


Lossless VDC is designed to seek a linear compression function F that satis- 
fie s Eq. ([7|) , which is a more restricted condition compared to our Theorem 
2. Poupart (2005) proposed that such an F can be obtained by iteratively 
exploring the Krylov subspace Kr{{T°’’^}a^A,zez, R) to find a Krylov basis 
(with k linearly independent column vectors). We can summarise this process 
as (i) initialising F to the linearly independent columns of R, and (ii) in each 
iteration, multiplying every column (Pi) of F with every (as T°‘'^Fi), and 
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Algorithm 1: Lossless/Lossy Value-Directed Compression (Poupart 


2005 


Chapter 4) 


1: input: R, {T‘^’^}aeA,z£Z, k /*truncation parameter, lossy VDC only*/ 

2: F 0 

3: C R /*candidate column set*/ 

4: repeat 

5a: c-^—C [first] /*lossless VDC*/ 

5b: c <—argmaxjigc mina; ||y — Fxjl /*lossy VDC*/ 

6: F •<— [F, c] 

7: for each (a, z) a A x Z 

8: C •!- [C,T“'^c] 

9: remove columns linearly dependent to F from C 

10: until C = 0 or length(F) = k /*truncat ion, lossy VDC only*/ 

11: solve Eq. Q /*lossless VDC*/ or Eq. | |15| | /*lossy VDC*/ 

to obtain R and \T°‘'^}a^A,z^z 
12: return F, R, {f'^'^}aizA,zizz 


appending the obtained vector to F if it is linearly independent of the columns 
of the current F. The process ends when no more columns can be added. 

Clearly, a lossless compression is achievable only if the Krylov subspace 
is low-rank. In a more general case, it was suggested that one can greedily 
select k basis vectors in a forward-search manner to approximately minimise 
the residual errors of their predictions on the remaining vectors in the Krylov 


subspace, which is known as lossy VDC (|Poupart|j2005|). After obtaining F, 
R and 


can be computed by either solving Eq. 
following regression problem for lossy VDC. 


7l for lossless VDC, or the 


R = argmin ||i?—FA ||f 
R 


and 


= arg rnin | 


VaeA, zez 
(15) 

where || • ||f denotes the Frobenius norm. 

For the convenience of future discussions, we list the pseudo-code for loss¬ 
less and lossy VDC in Algorithm 1. Note here, we adapt the representation 
of the lossless VDC algorithm to make it more comparable to lossy VDC, 
but it works exactly in the same way as the original algorithm presented in 


(Poupart 20051. In standard numerical computation libraries (e.g. LAPACK 
(Anderson et ah 19991), an overdetermined linear equation is usually solved as 
a least-squares problem, which suggests that Eq. (0 and Eq. (15) are treated 
the same in practice, and both lead to a solution in the form of Eq. ([^, with 
F^ being the pseudo-inverse of F in this case. Therefore, comparing the two 
VDC algorithms, one can see that the essential difference between them is 
their column selection strategy. 


4.1 Deficiency of VDC 

Firstly, VDC does not guarantee a nonnegative F unless R is nonnegative. 
Therefore, according to Lemma 4, if PBVI is applied to the compressed prob¬ 
lem, some a-vectors may be mistakenly pruned out, resulting in the algorithm 
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converging to a non-optimal policy. Secondly, as F is unregularised in VDC, 
can be arbitrarily large. In the worst case, the compressed value func¬ 
tion may diverge to infinity. These arguments seem to contradict the proofs in 
( |Poupart| 2005), so we give more detailed explanations as follows. 

Theorem 1 is derived by assuming that the compression is lossless. How¬ 
ever, due to numerical errors in practical computations, a totally ‘error-free’ 
solution never really exists. Especially for lossless VDC, because of the lack 
of consideration of system conditioning, it tends to be less robust to numer¬ 
ical errors than lossy VDCQ In other words, residual errors always occur in 
the compressed value function. Furthermore, although Poupart (2005) also 
developed an error bound for lossy VDC, that: 


\\V* -FV* 


1 - 7 ] 




l-?7 


iziiiv* 


(16) 


where tR = |ji? — Fi?||oo, er = maxa^^ — FT°''^\\ao, in the case where 

V* is a diverging function, the bound itself is infinitely large. 


4.2 Empirical Evidence 


To support our argument on the deficiencies of VDC, we investigated its per¬ 


formance on two benchmark problems. Coffee (Boutilier and Poole 1996) and 


Hallway2 (Liftman et al 1995). 


In the implementation of VDC, we experiment with two ways of judg¬ 
ing the linear dependence between a new obtained vector c and the existing 
columns of F. The first method is to set a threshold r for the least-squares 
residual r = ||c — Er(;|| 2 , where w = argmax.^ ||c — Fw\\ 2 - Concretely, c will 
be appended to F only if r > t. The second way of doing this is to check 
whether rank))!^,c]) > rank(E) with rank(-) denoting the numerical rank of 
a matrix. The quantities cr and cr as defined in Eq. (16) are taken as mea¬ 


sures of compression error. The compression quality of VDC with respect to 
different residual thresholds t (for lossless VDC) and truncation levels k (for 
lossy VDC) on the two benchmark problems are illustrated in Figure Note 
that in the Coffee problem, the Krylov iteration for the rank-based lossless 
VDC Hnishes when 201 columns in F are obtained, however, there is still a 
significant residual error ex at this point. This is an example of the numerical 
instability issue of VDC. 

Some example points in Figure [T] are selected to examine the policy quality 
obtained from the corresponding compressed POMDPs. To roughly ensure the 
same compression level for lossless and lossy VDC, we choose lossless VDC 
with T = 10“^ (221 dimensions) and lossy VDC with k = 200 for Coffee, and 
lossless VDC with r = 10“® (40 dimensions), and lossy VDC with fc = 40 for 
Hallway2. In addition, we evaluate the rank-based VDC (201 dimensions for 


^ Detailed analysis on the numerical stabilities of VDC is out of the main scope of this 
paper, but can be found in the supplementary material. 
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Fig. 1 Compression errors of lossless and lossy VDC on Coffee and Hallway2. 
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Fig. 2 Expected value during value iterations on Coffee and Hallway2. 


Coffee 

Lossless/rank 

Lossless/r = 10 ^ 

Lossy/k = 200 

Uncompressed 

Reward 

-5.52 ±2.11 

-14.36 ± 2.34 

10.88 ±0.17 

11.16 ±0.14 


Hallway2 

Lossless/rank 

Lossless/r = 10 ® 

Lossy/fc = 40 

Uncompressed 

Reward 

0.01 ± 0.00 

0.23 ±0.02 

0.24 ± 0.04 

0.34 ± 0.02 


Table 1 Average sampled rewards on Coffee and Hallway2. 
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Coffee and 52 dimensions for Hallway2) and the original POMDPs for both 
problems as well. Perseus (Spaan and Vlassis 2005) is employed to solve the 
compressed and uncompressed POMDPs here. The the expected value growth 
for each algorithm during the value iteration procedures are shown in Figure 
where all the experiments are based on 5000 sampled belief points. We 
also sample 1000 decision trajectories for each learned policy to compute an 
average reward, which gives an insight into the actual quality of the obtained 
policy, since the expected values for compressed problems can be unreliable 
according to our discussion in Section |4.1[ The above policy learning and 
evaluation procedure is repeated five times for each task, and Table shows 
the means and standard deviations of the average rewards. 

These two problems demonstrate all the deficiencies of VDC mentioned 
above. Firstly, as shown in Figure the rank-based VDC results in a di¬ 
verging value function in both tasks. Especially in Hallway2, even when the 
compression error is in the 10 “^*^ level, it may still cause a significant (possibly 
unbounded) loss in the value function. Secondly, as can be seen in the Coffee 
problem, the policy learning for lossless VDC with r = 10“^ converges to an 
unreasonably high value, but the corresponding average sampled reward (in 
Table is extremely low. This is due to the pruning procedure being confused 
by the negative elements in F. More concretely speaking, some a-vectors that 
should be dominated by all the others are mistaken as the dominating ones. 

The above theoretical and empirical analysis suggests that the performance 
of VDC is not guaranteed in practical usage, hence we stop further experiments 
with it in this paper. 


5 Orthogonal NMF for Belief Compression 


Orthogonal NMF (0-NMF) based belief compression, introduced by Li et al 


(20071, explores an alternative direction by taking advantage of the possible 


low-rankness of the reachable belief space B defined by a POMDP. It seeks 
a nonnegative factorisation of a sampled set of beliefs B = [5i, 62 ,..., 6 m] 
subject to an orthogonal constraint, such as: 


B « FB s.t. FF^ =7, F >0 and B >0 


(17) 


where B denotes the set of the compressed beliefs and I is the identity ma¬ 
trix. The compressed reward and transition matrices are then constructed by 
substituting F^ = F^ into Eq. (1^. 


5.1 Deficiency of 0-NMF Belief Compression 

An obvious deficiency of the above formulation is that the orthogonal con¬ 
straint can never be satisfied in practice, since the compression matrix F is of 
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low-rank. Therefore, the algorithm proposed in (Li et al 20071 actually only 
solves the following optimisation problem. 


min \\B-Fm + X\\I-FF^\\l (18) 

F>0.B>0 


where A > 0 is a coefficient balancing the weights of the two loss functions in 
the objective!^ 

Nevertheless, when compared to VDC, 0-NMF belief compression has the 
superiority that the nonnegative F ensures that PBVI works properly (Lemma 
4). Moreover, minimising the Frobenius norm difference between FF^ and I 
approximately controls the scale of ||FF^||oo due to the equivalence of norms 
in finite dimension, which preserves the convergence of the compressed value 
function (Lemma 2). The fact that essentially makes 0-NMF belief compres¬ 
sion work can be understood as follows. Assume we have a compressed belief 
b such that the original belief b = Fb. Then the compressed value function for 
a given policy tt can be computed as: 

V^(b) = 

z 

= h^R-Mb) + 

z 

= h^R.,^(b)+vY. (19) 

Z 

Note here, although the compressed belief h here is not obtained as in Eq. ^ , 
the corresponding V can still be related to V. Letting , we 

have: 

V^{b) - V^(b) = r]Y^[V^ (20) 


Going further, we can prove the following bound on the difference between V 
and V. 


Theorem 3 For value functions V and V defined in Eq. Q) and Eq. {IS) 
respectively, */77||A||oo < 1, then the following bound holds. 


||F- V\\^<j^^^m\U+v\Z\\\V*\U 


Although 


Li et al 


equivalent to Eq, 
with k < n, as rank(P F 


rank(i’ 


I 2007 1 suggested that in particular way of selecting A, Eq. 
it is well-known that FF^ = 7 is impossible for any matrix F ( 
) = rank(i^) < k < n = rank(7). 


is 

k 
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Proof 

||y- Filoo =max|!l^-- F^loo 

TT 

< max IIi?.,a + - AR.^a - 77 ^ a||oo 

O', a,a 

z z 

= max II (/ - A)R.^a + t? V - vY" AT'^'^a 

z z 

+ r,Y2 -vY2 “II- 

z z 

< 11(7 - 7l)7?||oo + max ||77(/ - 7l) V 

oc,a 

z 

||77^^T“’Fa- a)lloo 

,.+7?||A|U||T/- F||, 


max I 

cx, OL,a 


< ||7-A||oo||7?||oo+77|2|IF-^lloo||V^* 

< tL^(P||^+^|Z|||F*||^) 


1-^PIIc 

where we apply Lemma 2 and the facts that ||r“ 


< 1 and 


< IIFII 00 . 
□ 


Theorem 3 implies that the 0-NMF method (with A = FF^) minimises 
the upper bound of the value loss caused by compression (as ||7 — 7l||oo < 
^/n\\I — ^||f)- However, such a bound can be quite loose in practice. The 
drawback of 0-NMF belief compression is that it does not directly relate the 
error in compressing a bel ief b to the loss in its corresponding compressed value 
function, because Eq. (20) indicates that even if 6 = F& holds, V'^{b) — V^(b) F 
0 since V'^{b') = V^{b') does not hold in general for the successor beliefs b' 
transited to from b. 


5.2 On the Locality Preserving NMF Belief Compression 


Theocharous and Mahadevan (2010) proposed an extension to the 0-NMF 


method, which is formulated as follows. Firstly, the sampled belief set B is sub¬ 
sampled into a smaller set B' by only including those points in the original B 
that are at least S apart in terms of Euclidean distance, where 5 is a pre-fixed 
threshold. Then, in finding a factorisation of B', the original Frobenius norm 
loss is replaced by an unnormalised KL-divergence loss between B' and FB'. 
After this, an extra risk is introduced, which measures the distance (with 
respect to symmetric KL-divergence) between each pair of the compressed 
beliefs weighted by a neighbourhood graph among the original belief points. 
The neighbourhood graph is constructed by connecting every belief point with 
its Tf-nearest neighbourhood (KNN). Finally, similar to 0-NMF, F'^ can be 
obtained by approximating I « FF^. The insight behind this algorithm is that 
the neighbourhood graph inspired risk function forces two compressed beliefs 








On the Linear Belief Compression for POMDPs 


17 


to be close to each other if their corresponding original beliefs are so. Hence, 
it is named “locality preserving” NMF (LP-NMF) belief compression. 

According to our discussion in Section [5d^ LP-NMF shares both the ad¬ 
vantages and the drawbacks of the 0-NMF method. In addition, the local¬ 
ity preserving property is only intuitively motivated, as the initial closeness 
among the belief points will be lost soon during value iteration due to recur¬ 
sive influence of T“’^. Furthermore, KL-divergence does not tend to be a good 


measure for selecting the linear compression matrix F. As shown in (Poupart 


and Boutilier 20001, the quality of the policies resulting from approximate 


belief monitoring can be significantly lower than the original policy even when 
the KL-divergence remains fairly small, whilst policy quality can be unaffected 
when KL-divergence is large. 


6 Projective NMF for Belief Compression 

To address the drawbacks of 0-NMF belief compression, we propose a novel 
projective NMF belief compression algorithm motivated by Theorem 2 and 
Lemma 2. 

Again, we start from an ideal case. Assume that for a POMDP, the reach¬ 
able belief space B is row-rank (say rank(H) = k ^ n). Let B be a sampled 
set of beliefs, and B C span(B) (i.e. rank(B) = k). If we can find a matrix 
F G and F > 0, such that: 

B = FF^B and HFF^Hoo < - (21) 

V 

letting = F^ and substituting it into Eq. ^ gives a lossless compression of 
the original POMDP. The correctness of this proposition is easy to check. As 
B contains a linear basis of B, there exists a weight vector w such that b = Bw, 
V6 € B. Therefore, FF^b = FF^Bw = Bw = b, yb G B holds. Moving to 
the imperfect case, if for a subset of the sampled beliefs B (Z B, B = FF^ B 
is achievable, then V& G span(B), b = FF^b. Hence, the insight here is that 
we attempt to reduce the compression error directly from the value function, 
instead of minimising a loose error bound as 0-NMF belief compression does. 

Intuitively, the optimisation problem to seek the F for the proposed method 
could be formulated as: 

mm\\B-FF^B\\i s.t. ||FFT||^ < 1 (22) 

F>0 Tj 

However, we suggest using the Frobenius norm instead of the Li/Lco matrix 
norms for efficiency purposes, as minimising the latter requires repeatedly 
solving linear programming problems, which is computationally much more 
expensive. Hence, the problem becomes: 

mn ^\\B-FF^B\\l + ^\\FF^\\l 


(23) 
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where A is a regularisation coefficient to approximately control the scale of 
FF^. Our preliminary experiments indicate that A can be empirically selected 
and that r]\\FF^\\ao being slightly greater than 1 usually does not affect the 
convergence of value iteration in practice. 

Eq. (231 forms a regularised projective NMF problem. As suggested by 


Yang and Oja (20101, this class of problems are convex and thus can be solved 


via gradient descent as follows. Define the objective function: 

g(F) = ^\\B-FF^B\\l + ^\\FF^\\l 

The the constrained gradient of g for F is given by: 

OF,, 


(24) 




= -2{BB ' + {FF ' BB ' F),^^ + {BB ' FF ' F)ij + 2X{FF ' F),^j 

(25) 


After this, we can construct the additive update rule for minimisation: 

dg 


Fij ^ Fij — 




(26) 


where Qj is a positive step size. In order to keep Fij staying nonnegative, Qj 
can be selected as: 


Ci,j — 


F- ■ 

^'^,3 


{FF^BB^F),^j + {BB^FF^F),^j + 2X{FF^ F),^^ 


Substituting Eq. ( |27[ ) into Eq. ( |26| ), we have: 

2{BB^F),^j 


p.^F 


{FF^BB^F)^^j F {BB^FF^F)ij + 2X{FF^ 


(27) 


(28) 


We will refer to this method as P-NME belief compression in the remainder 
of this paper. 


7 Experimental Results 

In previous literature, belief compression is usually conceptually demonstrated 
on small-scale benchmark problems. Here, we are interested in seeing its actual 
performance on POMDPs with large numbers of states. We use empirical 
methods to investigate the following two questions: 

1. whether P-NMF could improve over 0-NMF and LP-NMF in policy qual¬ 
ity as expected, since it is more focused on error reduction in the value 
function; 

2. whether belief compression in general is a preferable technique to state-of- 
the-art POMDP solvers, or under what situation it is so. 
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7.1 Experiments on Benchmark Problems 


The P-NMF, 0-NMF and LP-NMF belief compression methods are now com¬ 


pared on four POMDP problems, including UnderwaterNavigation (Kurni- 
awati et ah 2008), LifeSurvey (Smith et al 2007), RockSample[7,8] ( [Smith 


and Simmons 2004), and a spoken dialogue system problem (ComplexGoal- 


described in 

Perseus ( 

Crook and Lemon 

2011 

)■ 

Spaan and Vlassis 

2005 

) is employed to solve the compressed 

POMDPs. In addition, SARSOP ( 

Kurniawati et al 

2008 

) is used to give a 


measure of the rewards that can be achieved for uncompressed problems, as 
Perseus tends to be unable to solve a POMDP with more than a few thousand 
states in acceptable time. SARSOP is a state-of-the-art PBVI POMDP solver 
that achieves its efficiency by restricting itself to only sample belief points near 
the subset of those reachable from the initial belief under optimal sequences 
of actions, where the sampling region is controlled by heuristic exploration of 
sampling paths. Hence, its performance here also gives us an insight into the 
practical competitiveness of belief compression in solving large-scale POMDPs. 

In the following experiments, the P-NMF, 0-NMF and LP-NMF belief 
compression algorithms are implemented in Matlab. The parameters for P- 
NMF and LP-NMF are empirically tuned based on preliminary experiments. 


For 0-NMF, we follow Li et al s (2007) method of automatically selecting A to 


enforce the orthogonal constraint (though the orthogonality is unachievable). 
The compression and policy optimisation processes are based on 20,000 belief 
points randomly sampled by Perseus, in all the tasks except RockSample[7,8] 
where 100,000 beliefs are sampled to maintain proper performance of the be¬ 
lief compression methods. The computing time for all algorithms is measured 
using CPU time (seconds) on a computer with 2xSix-Cores 2.4GHz CPUs and 
128GB memory. The compression levels k are selected empirically in consider¬ 
ation of both compression errors and computational complexities. We sample 
1000 decision trajectories to compute an average reward for each policy ob¬ 
tained in each task. The policy learning and reward sampling procedures are 
repeated five times for each task to calculate a mean and standard deviation, 
as summarised in Table |2l 

It can be found that the proposed P-NMF method consistently outperforms 
0-NMF and LP-NMF with respect to the rewards obtained in all the four 
tasks. The performance of LP-NMF compares unfavourably to that of the other 
two belief compression methods in most of the tasks, and is sometimes very 
unstable, e.g. large variances in its average sampled rewards can be observed 
in the UnderwaterNavigation problem. Note that, for Rocksample[7,8] one can 


® The only update here is to add an extra initial state so standing for the beginning of a 
dialogue, for which the transition probabilities to the other states s for all system actions 
are set to be proportional to with self-transition eliminated, where |s| stands for the 

number of user goals contained in s. The purpose of such an modification is to yield more 
natural conversations. 
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Task 

Algorithm 

k 

Reward 

Time (P) 

Time (C) 

ComplexGoalDialog 

P-NMF 

100 

23.5±0.3 

500 

7.3 X 10^ 

(j<Sl ; 4097, \A\ ; 23, \Z\ : 49) 

O-NMF 

100 

23.1±0.2 

500 

1.6 X 10^ 


LP-NMF 

100 

13.2±0.3 

500 

4.2 X 103 


SARSOP 

n/a 

24.2±0.3 

600 

n/a 

LifeSurvey 

P-NMF 

200 

89.9±5.8 

1,500 

1.9 X 10"* 

(|51 : 7841, \A\ ; 7, \Z\ ; 28) 

O-NMF 

200 

73.3±12.9 

600 

6.5 X 10^ 


LP-NMF 

200 

(-8.1 ± 1.2) X 10^ 

150 

1.4 X 10"* 


SARSOP 

n/a 

105.7±1.1 

200 

n/a 

UnderwaterNavigation 

P-NMF 

100 

530.3±37.8 

180 

2.9 X 10** 

(|51 : 2653, \A\ : 6, \Z\ ; 103) 

O-NMF 

100 

354.5±31.4 

300 

1.0 X 10*^ 


LP-NMF 

100 

489.7±283.2 

300 

0.9 X 10*^ 


SARSOP 

n/a 

733.5±5.0 

160 

n/a 

Rocks ample [7,8] 

P-NMF 

100 

13.2±0.3 

6,000 

2.9 X 10*> 

(|51 : 12545, \A\ : 13, \Z\ : 2) 

O-NMF 

100 

7.4±0.0 

60 

4.4 X 10"* 


LP-NMF 

100 

7.4±0.0 

70 

3.0 X 10*^ 


SARSOP 

n/a 

20.9±0.5 

130 

n/a 


Table 2 Performance of belief compression on benchmark problems. Time (P) and Time 
(C) stand for policy optimisation time and compression time (averaged in five trials), re¬ 
spectively. 


regard 0-NMF and LP-NMF as unable to solve the problem properly, as only 
1 a-vector is produced in their policies. 

Nevertheless, none of the above belief compression methods can guaran¬ 
tee to work on all POMDPs. For example, we also apply them to another 

and Fourth 


20081 


two benchmark problems, Homecare (Kurniawati et al 
CIT ( Cassandra] 19981, where they all fail to produce a meaningful policy. 
The failures could partially due to the limitations of Perseus, e.g. for Fourth 
CIT (|5| = 1052, |yl| = 4, |Z| = 28), Perseus itself fails to solve it, even 
though 100,000 belief points are sampled. Another reason could be the lack 
of low-rankness in the reachable belief space of a problem, e.g. for Homecare 
(|5| = 5408, \A\ = 9, \Z\ = 928), if we sample a H of 10,000 belief points and 
analyse its singular values, it can be found that there are about 2,000 of them 
with non-trivial values. Therefore, a too low-dimensional compression cannot 
sufficiently approximate the problem, whilst a compression with too many di¬ 
mensions is computationally intractable for both Perseus and the compression 
algorithm itself. 

Moreover, considering either time expense or policy quality, belief compres¬ 
sion in general tends to be less competitive than SARSOP, which motivates 
further investigation in the next section. 


7.2 Belief Compression vs. the State-of-the-art: A Further Comparison 


Although SARSOP (Kurniawati et al 2008) demonstrates its impressive effi¬ 
ciency and effectiveness in the above experiments, it can usually be noticed 
that its intermediate belief sampling procedure (a heuristic search-based sam¬ 
pling strategy) results in considerably higher space complexities than standard 
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PBVI algorithms (such as Perseus). This situation tends to be worse for prob¬ 
lems with more actions and observations. Hence, it suggests a potential ad¬ 
vantage for belief compression, as SARSOP may easily run out of memory for 
some large-scale problems. To further explore the relative merits of SARSOP 
and belief compression methods, we investigate two further dialogue problems 
as follows. 


7.2.1 POMDP Construction 


Firstly, a hand-crafted rule-based spoken dialogue system was set up in a labo¬ 
ratory environment. The system’s domain is restaurant search and it supports 


complex user goals (Crook and Lemon, 20101. Each simple goal (which forms 
part of a complex user goal) contains two pieces of information (2 attributes: 
food-type and location, with 4 and 3 values respectively). The combination of 
the attribute values forms 4 x 3 = 12 distinct goals. Each state of the system 
is drawn from the power-set of the 12 possible goals, i.e. 2^^ = 4096 states in 
total. We define 64 system actions and 807 user observations, with respect to 
different expressions of the state information. After this, 161 dialogues were 
collected from volunteers, and were manually transcribed and annotated. Then 
we used the collected data to train a POMDP. As the rule-based system as¬ 
sumes the user goal to be unchangeable during a dialogue, we preserve this 
setting in the first version of the POMDP (called “Dialog/Identity”) by setting 
the transition matrices to the identity matrix for all system actions, except 
those for the initial state sq. (For the POMDP model trained here we take 
the state with an empty goal set as sq-) For sq the transition probabilities to 
other states for all actions are defined to be proportional to the probability 
mass of a Poisson distribution over the number of goals contained in the tar¬ 
get state. A non-zero reward is only given to those actions that present goal 
information to the user, where i?(s, a) = 10 if the information presented by a 
is fully contained in s, i?(s, a) = —10 otherwise. The observation probabilities 
are modelled using the exponential family, for which the parameter is trained 
by maximising the regularised log-likelihood on the data. To be concrete, we 
use feature vectors to represent the states, actions and observations, as (f)s(s), 
4>ai,ci) and 4>ziz) respectively, where binary values are used to indicate the oc¬ 
currences of certain attribute values, and extra fields are introduced for actions 
and observations indexing the dialogue act type. Then we let the joint feature 
representation of a {z, s, a)-tuple be the tensor product of the corresponding 
individual feature vectors as s, a) = ipsis) ® <^o(a) 0 (f>z{z), and formulate 
the observation probabilities as 


I7(z|s, a) 


eycplvo^(j){z^ s,a)] 
exp['u;T())( 2 ;', s, a)] 


(29) 


where w is the parameter to be trained on the data. Finally, we eliminate 
those observation probabilities below the threshold 10“® and re-normalise the 
distributions, to achieve a sparser problem for space efficiency purposes. 
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Task 

Algorithm 

k 

Reward 

Time (P) 

Time (C) 

Dialog/Identity 

P-NMF 

100 

41.3 

3,000 

3.0 X 10^ 

(151 ; 4096, \A\ ; 64, |.Z| : 807) 

SARSOP 

n/a 

49.0 

2,712 

n/a 

Dialog/Browsing 

P-NMF 

100 

23.1 

3,000 

3.0 X lO"" 

(15[ ; 4096, \A\ : 64, \Z\ : 807) 

SARSOP 

n/a 

- 

OO 

n/a 


Table 3 SARSOP vsbelief compression on two challenging dialogue problems. P-NMF com¬ 
pressed problems are solved using Perseus, with 100,000 sampled belief points for compres¬ 
sion and policy optimisation. 


1.2.2 Results 


The performance of SARSOP and P-NMF belief compression for this problem 
is compared in the upper half of Table (Dialog/Identity), where SARSOP 
runs out of memorjj^ within its first 30 iterations (after the initialisation step), 
but a promising policy is still obtained. A possible reason for this could be the 
identity transition matrices, which make the problem converge easily. There¬ 
fore, we slightly modified the previous setting to provide a further realistic 
challenge, in the problem “Dialog/Browsing”. In this second version of the 
POMDP, instead of keeping the user goal fixed, we assume that a user will be 
able to request for alternative goals when its current goal is correctly presented 
by the system, and this process can last for an infinite number of turns (i.e. an 
infinite-horizon planning problem). This corresponds to dialogues where users 
are browsing through the possible entities (e.g. find out what Thai restuarants 
there are, and then search for the closest restaurant). 

To enable such a setting, we re-define those transition probabilities T(-|a, s) 
that have R{s, a) = 10 to be identical to T{-\a, Sq). The performance of SAR¬ 
SOP and the P-NMF method on this modified POMDP (Dialog/Browsing) is 
shown in the lower half of Table[^ This time, SARSOP fails to finish in accept¬ 
able time, as it takes more than 15 hours to run each iteration when initialising 
the fast informed bound ( [Hauskrecht , 2000). On the contrary, the efficiency of 
P-NMF is much more preferable in this case. Furthermore, by looking into the 
sampled dialogue trajectories of the compressed problem, we found that the 
ratio between its correct decisions (reward 10) and incorrect decisions (reward 
-10) is approximately 3:1, which suggests a reasonable quality of the policy. 


8 Conclusion 

This paper introduces a theoretical framework to analyse linear belief com¬ 
pression techniques, under which the deficiencies of three existing algorithms 
are presented. The findings indicate that policy quality reduction resulting 
from the compression can be relieved if those deficiencies are properly revised, 
as demonstrated by a new proposed P-NMF model. However, the overall per¬ 
formance of belief compression techniques tends to be less competitive in com- 

Here we set a 100GB memory limit for SARSOP to reserve sufficient resource for normal 
operating system activities. 
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parison with a state-of-the-art POMDP solver, such as SARSOP. However, we 
show that under particular situations SARSOP may fail due to time or space 
complexities, whilst belief compression could provide a feasible solution. A 
further question posed here would be whether there exists a way of combining 
SARSOP and belief compression to achieve further efficiencies. Unfortunately, 
our preliminary answer is negative, as various underlying theories (e.g. the 
fast informed bound) in SARSOP’s heuristics rely on beliefs being distribu¬ 
tions. The possibility of applying alternative heuristics in solving compressed 
problems still requires further investigation. 
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